MATRIX CONCENTRATION INEQUALITIES 
VIA THE METHOD OF EXCHANGEABLE PAIRS 



LESTER MACKEY AND MICHAEL I. JORDAN* 
RICHARD Y. CHEN, BRENDAN FARRELL, AND JOEL A. TROPP f 

"<N 

Abstract. This paper derives exponential concentration inequalities and polynomial moment in- 
equalities for the spectral norm of a random matrix. The analysis requires a matrix extension of 
£N) ■ the scalar concentration theory developed by Sourav Chatterjee using Stein's method of exchange- 

able pairs. When applied to a sum of independent random matrices, this approach yields matrix 
, generalizations of the classical inequalities due to Hoeffding, Bernstein, Khintchine, and Rosenthal. 

The same technique delivers bounds for sums of dependent random matrices and more general 
qq | matrix-valued functions of dependent random variables. 

This paper is based on two independent manuscripts from mid-2011 that both applied the method 
of exchangeable pairs to establish matrix concentration inequalities. One manuscript is by Mackey 
and Jordan; the other is by Chen, Farrell, and Tropp. The authors have combined this research 
into a single unified presentation, with equal contributions from both groups. 
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1. Introduction 



Matrix concentration inequalities control the fluctuations of a random matrix about its mean. 
fSj | At present, these results provide an effective method for studying sums of independent random 

matrices and matrix martingales [Oli09l ITrollal ITrollbl IHKZllj . They have been used to stream- 
line the analysis of structured random matrices in a range of applications, including statistical 
estimation [Kolll| , randomized linear algebra [Gitllj ICDllbj , stability of least-squares approxima- 
tion [CDLllj , combinatorial and robust optimization [SolU ICSWllj , matrix completion [Grolll 
| IRecllj IMTJ11] , and random graph theory |Oli09j . These works comprise only a small sample of 

the papers that rely on matrix concentration inequalities. Nevertheless, it remains common to 
encounter new classes of random matrices that we cannot treat with the available techniques. 

The purpose of this paper is to lay the foundations of a new approach for analyzing structured 
random matrices. Our work is based on Chatterjee's technique for developing scalar concentration 
inequalities [Cha07l ICha08| via Stein's method of exchangeable pairs [Ste72j . We extend this 
argument to the matrix setting, where we use it to establish exponential concentration results 
(Theorems 14.11 and I5.ip and polynomial moment inequalities (Theorem 17. ip for the spectral norm 
of a random matrix. 

To illustrate the power of this idea, we show that our general results imply several important con- 
centration bounds for a sum of independent, random, Hermitian matrices [LPP9H IJX031 iTrollbj . 
We obtain a matrix Hoeffding inequality with optimal constants (Corollary 1-4. 2[) and a version of 
the matrix Bernstein inequality (Corollary 15 . 2[> . Our techniques also yield concise proofs of the 
matrix Khintchine inequality (Corollary IT.4H and the matrix Rosenthal inequality (Corollary I7.5p . 

The method of exchangeable pairs also applies to matrices constructed from dependent random 
variables. We offer a hint of the prospects by establishing concentration results for several other 
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classes of random matrices. In Section [9l we consider sums of dependent matrices that satisfy 
a conditional zero-mean property. In Section [TUl we treat a broad class of combinatorial matrix 
statistics. Finally, in Section [TTl we analyze general matrix- valued functions that have a self- 
reproducing property. 

1.1. Notation and Preliminaries. The symbol ||-|| is reserved for the spectral norm, which 
returns the largest singular value of a general complex matrix. 

We write for the algebra of all d x d complex matrices. The trace and normalized trace of a 
square matrix are defined as 

tr B : = bjj and tr B := - for B G M d . 

We define the linear space M d of Hermitian d x d matrices. All matrices in this paper are Her- 
mitian unless explicitly stated otherwise. The symbols A max (A) and A m i n (A) refer to the algebraic 
maximum and minimum eigenvalues of a matrix A G M d . For each interval / C 1, we define the 
set of Hermitian matrices whose eigenvalues fall in that interval: 

W d (I) := {A G M d : A max (A), A min (A) G /}. 

The set consists of all positive-semidefinite (psd) d x d matrices. Curly inequalities refer to the 
semidefinite partial order on Hermitian matrices. For example, we write A =4 B to signify that the 
matrix B — A is psd. 

We require operator convexity properties of the matrix square so often that we state them now. 

2 ^ A 2 + B 2 foraI1A)SeH <i (L1) 



2 ) 2 
More generally, we have the operator Jensen inequality 

(EX) 2 4EX 2 , (1.2) 

1 1 2 

valid for any random Hermitian matrix, provided that E \\X\\ < oo. To verify this result, simply 
expand the inequality E(X — EX) 2 ^ 0. The operator Jensen inequality also holds for conditional 

1 1 2 

expectation, again provided that E \\X\\ < oo. 



2. Exchangeable Pairs of Random Matrices 

Our approach to studying random matrices is based on the method of exchangeable pairs, which 
originates in the work of Charles Stein |Ste72j on normal approximation for a sum of dependent 
random variables. In this section, we explain how some central ideas from this theory extend to 
matrices. 



2.1. Matrix Stein Pairs. We begin with the definition of an exchangeable pair. 

Definition 2.1 (Exchangeable Pair). Let Z and Z' be random variables taking values in a Polish 
spaced Z. We say that (Z, Z') is an exchangeable pair if it has the same distribution as (Z', Z). In 
particular, Z and Z' must share the same distribution. 

We can obtain a lot of information about the fluctuations of a random matrix X if we can 
construct a good exchangeable pair (X,X'). With this motivation in mind, let us introduce a 
special class of exchangeable pairs. 



A topological space is Polish if we can equip it with a metric to form a complete, separable metric space. 
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Definition 2.2 (Matrix Stein Pair). Let (Z, Z') be an exchangeable pair of random variables taking 
values in a Polish space Z, and let \1/ : Z — > M d be a measurable function. Define the random 
Hermitian matrices 

X : = *(Z) and X' : = 
We say that (X, X') is a matrix Stein pair if there is a constant a £ (0, 1] for which 

ELY -X'\Z\=aX almost surely. (2.1) 
The constant a is called the sca/e factor of the pair. When discussing a matrix Stein pair (X, X'), 

1 1 2 

we always assume that E X r < oo. 

A matrix Stein pair (X,X') has several useful properties. First, (X,X') always forms an 
exchangeable pair. Second, it must be the case that MX = 0. Indeed, 

EX = -E [ELY - X'lZll = -E[X - X'] = 

a a 

because of the identity (12. If) . the tower property of conditional expectation, and the exchangeability 
of (X,X'). In Section [2.41 we construct a matrix Stein pair for a sum of centered, independent 
random matrices. More sophisticated examples appear in Sections [9l flOl and [TTJ 

2.2. The Method of Exchangeable Pairs. A well-chosen matrix Stein pair (X,X') provides a 
surprisingly powerful tool for studying the random matrix X. The technique depends on a simple 
but fundamental technical lemma. 

Lemma 2.3 (Method of Exchangeable Pairs). Suppose that (X,X') € M d x M d is a matrix Stein 
pair with scale factor a. Let F : M d — > M d be a measurable function that satisfies the regularity 
condition 

E\\(X - X') ■ F(X)\\ < oo. (2.2) 

Then 

E [X • F(X)} = — E \(X — X')(F(X) - F(X'))] . (2.3) 
2a 

To appreciate this result, recall that we can characterize the distribution of a random matrix by 
integrating it against a sufficiently large class of test functions. The additional randomness in the 
Stein pair furnishes an alternative expression for the expected product of X and the test function 
F. The identity (|2.3p is valuable because it allows us to estimate this integral using the smoothness 
properties of the function F and the discrepancy between X and X'. 

Proof. Suppose (X,X') is a matrix Stein pair constructed from an auxiliary exchangeable pair 
(Z, Z'). The defining property (12. ip of the Stein pair implies 

a ■ E[X • F(X)] = E[E[X - X'\Z] ■ F(X)] = E[(X - X') F(X)]. 

We have used the regularity condition (|2.2p to invoke the pull-through property of conditional 
expectation. Since (X,X') is an exchangeable pair, 

a ■ E[X • F(X)] = E[(X - X') F(X)] = E[(X' - X) F(X')] = - E[(X - X') F(X')]. 

The identity (|2.3[) follows when we average the two preceding displays. □ 
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2.3. The Conditional Variance. To each matrix Stein pair (X , X'), we may associate a random 
matrix called the conditional variance of X. The ultimate purpose of this paper is to argue that 
the spectral norm of X is unlikely to be large when the conditional variance is small. 

Definition 2.4 (Conditional Variance). Suppose that (X,X') is a matrix Stein pair, constructed 
from an auxiliary exchangeable pair (Z, Z'). The conditional variance is the random matrix 

A x := A X (Z) :=^-E[(X -X') 2 \Z], (2.4) 

where a is the scale factor of the pair. We may take any version of the conditional expectation in 
this definition. 

The conditional variance Ax should be regarded as a stochastic estimate for the variance of the 
random matrix X. Indeed, 

E[A X ]=EX 2 . (2.5) 
This identity follows immediately from Lemma 12.31 with the choice F(X) = X . 

2.4. Example: A Sum of Independent Random Matrices. To make the definitions in this 
section more vivid, we describe a simple but important example of a matrix Stein pair. Consider 
an independent sequence Z := (Yi, . . . ,Y n ) of random Hermitian matrices that satisfy Elj = 
and E ||Yfc|| < oo for each k. Introduce the random series 

X := Yi + • • • + Y n . 

Let us explain how to build a good matrix Stein pair (X,X'). We need the exchangeable 
counterpart X' to have the same distribution as X , but it should also be close to X so that we can 
control the conditional variance. To achieve these goals, we construct X' by picking a summand 
from X at random and replacing it with a fresh copy. 

Formally, let Yl be an independent copy of Yj, for each index k, and draw a random index 
K uniformly at random from {1, ... ,n} and independently from everything else. Define the ran- 
dom sequence Z' := (Yi, . . . , Y^-i, Y^, Yk+ij • • • > Yn)' It is eas Y to check that (Z, Z') forms an 
exchangeable pair. The random matrix 

X':=Y 1 + --- + Y K _ X + Y' K + Y K+1 + ■ ■ ■ + Y n 

is thus an exchangeable counterpart for X. To verify that (X, X') is a Stein pair, calculate that 

E[X -X'\Z\= E[Yk -Yk\Z] = lY. U h 1 E ^ - Y k\ z \ = \ Yll 1 Y *> = ~ X - 

The third identity holds because Y^ is a centered random matrix that is independent from Z. 
Therefore, (X, X') is a matrix Stein pair with scale factor a = n" 1 . 
Next, we compute the conditional variance. 

A x = ^-E[(X-X'f\Z] 

= 5ELW +E1 ?)- (2 - 6) 

For the third relation, expand the square and invoke the pull-through property of conditional 
expectation. We may drop the conditioning because Y! is independent from Z. In the last line, we 
apply the property that Yl has the same distribution as Yf-- 
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The expression (|2.6p shows that we can control the size of the conditional expectation uniformly 
if we can control the size of the individual summands. This example also teaches us that we may 
use the symmetries of the distribution of the random matrix to construct a matrix Stein pair. 



3. Exponential Moments and Eigenvalues of a Random Matrix 

Our main goal in this paper is to study the behavior of the extreme eigenvalues of a random 
Hermitian matrix. In Section 13.21 we describe an approach to this problem that parallels the 
classical Laplace transform method for scalar random variables. The adaptation to the matrix 
setting leads us to consider the trace of the moment generating function (mgf) of a random matrix. 
After presenting this background, we explain how the method of exchangeable pairs can be used 
to control the growth of the trace mgf. This result, which appears in Section [3.51 is the key to our 
exponential concentration bounds for random matrices. 

3.1. Standard Matrix Functions. Before entering the discussion, recall that a standard matrix 
function is obtained by applying a real function to the eigenvalues of a Hermitian matrix. The 
book [Hig08] provides an excellent treatment of this concept. 

Definition 3.1 (Standard Matrix Function). Let /:/—)■ R be a function on an interval I of the 
real line. Suppose that A E M d (I) has the eigenvalue decomposition A = Q ■ diag(Ai, . . . , \d) ■ Q* 
where Q is a unitary matrix. Then 



/(A) := Q 



'/(Ai 



Q* 



The spectral mapping theorem states that /(A) is an eigenvalue of f(A) if and only if A is an 
eigenvalue of A. This fact follows immediately from Definition l3.il 

When we apply a familiar scalar function to a Hermitian matrix, we are always referring to a 
standard matrix function. For instance, \A\ is the matrix absolute value, exp(A) is the matrix 
exponential, and log (A) is the matrix logarithm. The latter is defined only for positive matrices. 

3.2. The Matrix Laplace Transform Method. Let us introduce a matrix variant of the classical 
moment generating function. We learned this definition from Ahlswede- Winter [AW021 App.]. 

Definition 3.2 (Trace Mgf). Let X be a random Hermitian matrix. The (normalized) trace 
moment generating function of X is defined as 

m (0) := m x (0) ■= Rtre ex for 6 e R. 

We admit the possibility that the expectation may not exist for all values of 9. 

Ahlswede and Winter |AW02( App.] had the insight that the classical Laplace transform method 
could be extended to the matrix setting by replacing the classical mgf with the trace mgf. This 
adaptation allows us to obtain concentration inequalities for the extreme eigenvalues of a random 
Hermitian matrix using methods from matrix analysis. The following proposition distills results 
from the papers \AW02\ IHEIOl ITrollbl ICTTm] . 

Proposition 3.3 (Matrix Laplace Transform Method). Let X £ M d be a random matrix with 
normalized trace mgf m{6) := Etre 6 *^. For each t € R, 

P{A max (X) > t] < d- inf exp{-6t + logm(0)}. (3.1) 

6»>0 

P{A min (X) < t} < d- inf exp{-6t + log m{6)}. (3.2) 

#<0 
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Furthermore, 



EA max (X) < inf \ [\ogd + \ogm(9)]. (3.3) 

e>o 9 

EA min (X) > sup \ [logd + logm(0)]. (3.4) 

0<O & 



The estimates (|3.3[) and (|3.4p for the expectations are usually sharp up to the logarithm of the 
dimension. In many situations, the tail bounds (|3.ip and ()3.2[) are reasonable for moderate t, but 
they tend to overestimate the probability of a large deviation. Note that, in general, we cannot 
dispense with the dimensional factor d. See [Trollbl Sec. 4] for a detailed discussion of these issues. 
Additional inequalities for the interior eigenvalues can be established using the minimax Laplace 
transform method [GTllj . 



Proof. To establish (|3.ip . fix 9 > 0. Using Markov's inequality, we find that 

P{A max (X) > t} = P | e Amax ( ex ) > e e '} < e~ et ■ Ee Amax(f?x) 

= e- 9 '-EA mffi (e 91 ) < e~ et ■ Etr e ex . 

The third relation follows from the spectral mapping theorem. The final inequality holds because 
the trace of a positive matrix dominates its maximum eigenvalue. Identify the normalized trace 
mgf, and take the infimum over 9 to complete the argument. 
The proof of ([32]) parallels the proof of (pTT]) . For 9 < 0, 

P{A min (X) < t} = P{9\ min (X) > 9t} = ¥{X max (9X) > et} . 

We have used the property that — A m i n (A) = A max (— A) for each Hermitian matrix A. The remain- 
der of the argument is the same as in the preceding paragraph. 

For the expectation bound (|3.3j) . fix 9 > 0. Jensen's inequality yields 



E A max (X) = -1 E A max (#X) < 9- 1 log E e w ^ < 9^ log E tr e ex . 

The justification is the same as above. Identify the normalized trace mgf, and take the infimum 
over 9 > 0. Similar considerations yield (13.4|) . □ 



3.3. Studying the Trace Mgf with Exchangeable Pairs. The technical difficulty in the matrix 
Laplace transform method arises because we need to estimate the trace mgf. Previous authors 
have applied deep results from matrix analysis to accomplish this bound: the Golden-Thompson 
inequality is central to [AW021 IOli09[ lOhTO] . while Lieb's theorem [Lie73l Thm. 6] animates [Trolla[ 
ITrollbl IHKZ11] . 

In this paper, we develop a fundamentally different technique for studying the trace mgf. The 
main idea is to control the growth of the trace mgf by bounding its derivative. To see why we have 
adopted this strategy, consider a random Hermitian matrix X, and observe that the derivative of 
its trace mgf can be written as 

m'(9) =Etr [Xe ex ] 

under appropriate regularity conditions. This expression has just the form that we need to invoke 
the method of exchangeable pairs, Lemma 12.31 with F(X) = e ex . We obtain 

m'{9) = -L Efr [(X - X 1 ) (e ex - e ex ')} . (3.5) 

This formula strongly suggests that we should apply a mean value theorem to control the derivative; 
we establish the result that we need in Section 13.41 below. Ultimately, this argument leads to a 
differential inequality for m'(9), which we can integrate to obtain an estimate for m(9). 
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The technique of bounding the derivative of an mgf lies at the heart of the log-Sobolev method 
for studying concentration phenomena [LedOU Ch. 5]. Recently, Chatterjee [Cha071 l"Cha08| demon- 
strated that the method of exchangeable pairs provides another way to control the derivative of 
an mgf. Our arguments closely follow the pattern set by Chatterjee; the novelty inheres in the 
extension of these ideas to the matrix setting and the striking applications that this extension 
permits. 



3.4. The Mean Value Trace Inequality. To bound the expression (|3.5p for the derivative of 
the trace mgf, we need a matrix generalization of the mean value theorem for a function whose 
derivative is convex. We state the result in full generality because it plays a role later. 

Lemma 3.4 (Mean Value Trace Inequality). Let I be an interval of the real line. Suppose that 
g : / — > R is a weakly increasing function and that h : / — > R is a function whose derivative h! is 
convex. For all matrices A,B G M d (I), it holds that 

tr [{g(A) - g(B)) • (h{A) - h(B))] < \ tr [(g(A) - g(B)) • (A - B) • (h'(A) + h'(B))] . 

When h' is concave, the inequality is reversed. The same results hold for the standard trace. 

To prove Lemma [3.4} we need a standard trace inequality [Pet94j Prop. 3], which is an easy 
consequence of the spectral theorem for Hermitian matrices. 

Proposition 3.5 (Generalized Klein Inequality). Let ui,...,u n and vi, . . . , v n be real-valued func- 
tions on an interval I of the real line. Suppose that 

EUk(a) Vk(b) > for all a,b G /. (3.6) 
k 

Then 

tr u k (A) v k (B) > for all A,B G M d (I). 

With the generalized Klein inequality at hand, we can establish Lemma 13.41 by developing the 
appropriate scalar inequality. 



Proof of Lemma \3.4\ Fix a, b G /. Since g is weakly increasing, (g(a) — g(b)) ■ (a — b) > 0. The 
fundamental theorem of calculus and the convexity of h! yield the estimate 

(g(a) - 5 (6)) • (h(a) - h{b)) = (g(a) - g(b)) ■ (a - b) 



[ h'(ra + (1 - r)6) dT 
Jo 

< (g(a) - g(b)) • (a - 6) / [r ■ h'(a) + (1 - r) • h' 

Jo 



dr 



= \ [(g(a) - 9(b)) • (a - b) • (h'{a) + h'(b))] . (3.7) 

The inequality is reversed when h' is concave. 

The bound (|3.7p can be written in the form (|3.6p by expanding the products and collecting terms 
depending on a into functions Mfc(a) and terms depending on b into functions Vk(b). Proposition 13.51 
then delivers a trace inequality, which can be massaged into the desired form using the cyclicity of 
the trace and the fact that standard functions of the same matrix commute. We omit the algebraic 
details. □ 

Remark 3.6. We must warn the reader that the proof of Lemma 13.41 succeeds because the trace 
contains a product of three terms involving two matrices. The obstacle to proving more general 
results is that we cannot reorganize expressions like ti(ABAB) and tr(A_BC) at will. 
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3.5. Bounding the Derivative of the Trace Mgf. The central result in this section applies the 
the method of exchangeable pairs and the mean value trace inequality to bound the derivative of 
the trace mgf in terms of the conditional variance. This is the most important step in our theory 
on the exponential concentration of random matrices. 

Lemma 3.7 (The Derivative of the Trace Mgf). Suppose that (X,X') 6 M d x M d is a matrix 
Stein pair, and assume that X is almost surely bounded in norm. Define the normalized trace mgf 
m(9) := Etre ex . Then 

m'(9) < 9 • E tr [A x e 9X ] when 9>0. (3.8) 
m'(0) > 9 ■ E tr [A x e ex ] when 9<0. (3.9) 
The conditional variance Ax is defined in (12. 4h , 

Proof. Let us begin with the expression for the derivative of the trace mgf: 

d 



m'(9) = Etr 



_ ex 

de 



Etv[Xe 0X }. (3.10) 



We can move the derivative inside the expectation because of the dominated convergence theorem 
and the boundedness of X. 

Now, we apply the method of exchangeable pairs, Lemma 12.31 with F(X) = e ex to identify an 
alternative representation of the derivative (|3.10D : 

m'{9) = — Etr \{X - X')(e ex - e ex ')}. (3.11) 
2a 

We have used the boundedness of X to verify the regularity condition (12. 2p . 

The expression (|3.1ip is perfectly suited for an application of the mean value trace inequality, 
Lemma 13.41 First, assume that 9 > 0, and consider the function h : s 4 e Ss . The derivative 
hf '. s i — y 9e Ss is convex, so Lemma 13.41 implies that 

m >(0) < —Eti \(X - X') 2 ■ (e ex + e ex ')] 
4a 

= —Eti\(X -X') 2 -e dX ] 
2a 



0-Etr 



^-E[(X-X') 2 \Z] -e 9X 



The second line follows from the fact that (X , X') is an exchangeable pair. In the last line, we have 
used the boundedness of X and X' to invoke the pull-through property of conditional expectation. 
Identify the conditional variance Ax, defined in (j2.4[) . to complete the argument. 

The result for 9 < follows from an analogous argument. In this case, we simply observe that 
the derivative of the function h : s i— >■ e es is now concave, so the mean value trace inequality, 
Lemma l3.4( produces a lower bound. The remaining steps are identical. □ 

Remark 3.8 (Regularity Conditions). To simplify the presentation, we have instated a boundedness 
assumption in Lemma 13. 71 All the examples we discuss satisfy this requirement. When X is 
unbounded, Lemma 13.71 still holds provided that X meets an integrability condition. 

4. Exponential Concentration for Bounded Random Matrices 

We are now prepared to establish exponential concentration inequalities. Our first major result 
demonstrates that an almost-sure bound for the conditional variance yields exponential tail bounds 
for the extreme eigenvalues of a random Hermitian matrix. We can also obtain estimates for the 
expectation of the extreme eigenvalues. 
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Theorem 4.1 (Concentration for Bounded Random Matrices). Consider a matrix Stein pair 
(X,X') 6 M d x M d . Suppose there exist nonnegative constants c,v for which the conditional 
variance (|2.4p of the pair satisfies 



Ax cX + vl almost surely. (4-1) 



Then, for all t > 0, 



P{A min (X) < -t} < d-exp 



-i 2 
2v 



'{A max (X) > t} < d ■ exp i -- + ^ log ( 1 + ° 



< d ■ exp 



C C 2 V u 



-f 2 



2u + 2ct 



Furthermore, 



EA mi „(X) > - v / 2i;log(i 



EA max (X)< v/2u log d + clog d. 

This result may be viewed as a matrix analogue of Chatterjee's concentration inequality for scalar 
random variables [Cha07l Thm. 1.5(h)]. The proof of Theorem 14.11 appears below in Section [4.2i 
Before we present the argument, let us explain how the result provides a short proof of a Hoeffding- 
type inequality for matrices. 



4.1. Application: Matrix Hoeffding Inequality. Theorem 14.11 yields an extension of Hoeffd- 
ing's inequality [Hoe63] that holds for an independent sum of bounded random matrices. 

Corollary 4.2 (Matrix Hoeffding). Consider a finite sequence (ife)fc>i of independent random 
matrices in M d and a finite sequence (Afc)*.>i of deterministic matrices in M d . Assume that 

EYj. = and Yf? A 2 , for each index k. 

Then, for all t > 0, 

P { A — (E fc y *) >t}<d.e^ 2 /^ 2 where a 2 := ^ £ fc (A 2 + E if) | . 
Furthermore, 

E A max (J2 k Yk) < o-V^logd. 

Proof. Let X = ^2 k Y^. Since X is a sum of centered, independent random matrices, we can 
use the matrix Stein pair constructed in Section 12.41 According to (|2.6p , the conditional variance 
satisfies 

because Y k 2 =4 A 2 ,. Invoke Theorem 14. II with c = and v = a 2 to complete the bound. □ 

In the scalar setting d = 1, Corollary 14.21 reproduces to an inequality of Chatterjee |Cha07l 
Sec. 1.5], which itself represents an improvement over the classical scalar Hoeffding bound. In turn, 
Corollary 14.21 improves upon the matrix Hoeffding inequality of [Trollty Thm. 1.3] in two ways. 
First, we have reduced the constant in the exponent to its optimal value 1/2. Second, we have 
decreased the size of the variance measure because cr 2 < ^fe • 
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4.2. Proof of Theorem I4.lt Exponential Concentration. Suppose that (X, X') is a matrix 
Stein pair constructed from an auxiliary exchangeable pair (Z,Z'). Our aim is to bound the 
normalized trace mgf 

m(9) := Eire ex for fleR. (4.2) 

The basic strategy is to develop a differential inequality, which we integrate to control m(9) itself. 
Once these estimates are in place, the matrix Laplace transform method, Proposition 13.31 furnishes 
probability inequalities for the extreme eigenvalues of X. 

The following result summarizes our bounds for the trace mgf m{6). 

Lemma 4.3 (Trace Mgf Estimates for Bounded Random Matrices). Let (X,X') be a matrix Stein 
pair, and suppose there exist nonnegative constants c, v for which 

Ax cX + v I almost surely. (4-3) 

Then the normalized trace mgf m{9) := Etie ex satisfies the bounds 

v9 2 

\ogm(9) < when 9 < 0. (4.4) 



2 

logm(0) < 



<T^J 



(4.5) 



v9 2 

< — — when < 9 < 1/c. (4.6) 

2(1 — c9) 

We establish Lemma 14.31 in Section 14.2.11 et seq. In Section 14.2.41 we finish the proof of Theo- 
rem 14.11 by combining these bounds with the matrix Laplace transform method. 

4.2.1. Boundedness of the Random Matrix. First, we confirm that the random matrix X is almost 
surely bounded under the hypothesis (|4.3[) on the conditional variance Ax- Recall the defini- 
tion (|2.4p of the conditional variance, and compute that 

Ax = — MX - X') 2 I Z\ >p — (E[X - X' I Z}) 2 = —iaX) 2 = -X 2 . 
2a LV ; 1 J 2a 1 1 u 2a K ' 2 



The semidefinite bound is the operator Jensen inequality (|1.2p . applied conditionally. The third 
relation follows from the definition (|2.1|) of a matrix Stein pair. Owing to the assumption (|4.3p . 
we reach the quadratic inequality \aX 2 ^ cX + v I. The scale factor a is positive, so we may 
conclude that the eigenvalues of X are almost surely restricted to a bounded interval. 

4.2.2. Differential Inequalities for the Trace Mgf. The fact that X is almost surely bounded ensures 
that the derivative of the trace mgf has the form 

m'(9) = Etr [Xe ex ] for 9 G R. (4.7) 

To bound the derivative, we combine Lemma 13.71 with the assumed inequality (|4.3p for the condi- 
tional variance. For 9 > 0, we obtain 

m'(9) < 0-Etr [A x e ex ] 

< 9 -Etr [(cX + vl) e 9X ] 

= c9-E tr [Xe dX ] + v9 ■ E tr e 9X 

= c9-m'(9) + v8-m(9). 

The second relation relies on the fact that the matrix e is positive. In the last line, we have 
identified the trace mgf (|4.2p and its derivative (|4.7p . For 9 < 0, the same argument yields a lower 
bound 

m'{9) >c9-m'(9) + v9-m(9). 
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Rearrange these inequalities to isolate the log-derivative m'(9)/m(9) of the trace mgf. We reach 

-j-k>gm(fl)< V — for < 9 < l/c, and (4.8) 
d9 1 — co 

■4; log m(9) > — ^— for 9 < 0. (4.9) 
d9 1 — c9 

4.2.3. Solving the Differential Inequalities. Observe that 

log m(0) = log tr e° = log tr I = log 1 = 0. (4-10) 



Therefore, we may integrate the differential inequalities (|4.8p and (|4.9p . starting at zero, to obtain 
bounds on log m(9) elsewhere. 

First, assume that < 9 < l/c. In view of (|4.10p . the fundamental theorem of calculus and the 
differential inequality (|4.8p imply that 



f e d f e vs v 
log m(9) = / — logm(s)ds< / ds = — k(c9 + log(l — c9)). 

Jo ds Jo 1 - cs c 

We can develop a weaker inequality by making a further approximation within the integral: 

f e vs , f e vs , v# 2 
logm(f) < / ds < / H ds - 



/ 1-cs - Jo 1-cB ~ 2(1 - off)" 

These inequalities are the trace mgf estimates (|4.5|) and (|4.6|) appearing in Lemma [ 
Next, assume that 9 < 0. In this case, the differential inequality (I4.9P yields 

Z" d Z" vs f° v9 2 

— logm(9)= / — logm(s)ds> / ds > / vsds = . 

J 8 ds J e 1 - cs J e 2 

This calculation delivers the trace mgf bound (|4.4p . and the proof of Lemma 14.31 is complete. 



4.2.4. The Matrix Laplace Transform Argument. With Lemma 14.31 at hand, we quickly finish the 
proof of Theorem 14.11 First, let us establish probability inequalities for the maximum eigenvalue. 
The Laplace transform bound (13. ip and the trace mgf estimate (14. 5|) together yield 

P{A max (X) >i}< d- inf expf-0t- ^{c9 + log(l -c0))| < d • exp (-- + ^ log ( 1 + - H . 

6»>0 L c z J [cc z \t;/J 



The second relation follows when we choose 9 = t/(v + ct). Similarly, the trace mgf bound (14.6 
delivers 

v9 2 i , r t 2 



'{A max (X) > i} < d ■ inf exp <^ -9t + — — } < d ■ exp 



e>o r [ 2(1 - c9) J - [ 2w + 2ct 

where we have selected 9 = t/(v + ct) again. To control the expectation of the maximum eigenvalue, 
we invoke the Laplace transform bound (13. 3p and the trace mgf bound (14. 6|) to see that 

32 



EA max (X) < inf - 
e>o 9 



logd + 



\J 2v log d + c log d. 



2(1 - c9 

The second relation can be verified using a computer algebra system. 

Next, we turn to the results for the minimum eigenvalue. Combine the matrix Laplace transform 
bound (|3.2|) with the trace mgf bound (|4.4j) to reach 

P{A min (X) < -t} < d- inf exp(fft + — \ =d-e~ t2/2v . 

8<0 { 2 J 
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The infimum is attained at 9 = —t/v. To compute the expectation of the minimum eigenvalue, we 
apply the Laplace transform bound (|3.4p and the trace mgf bound (|4.4p . whence 



EA min (X) > sup - 
0<o y 



log d + — 



■yjlv logd. 



The infimum is attained at 9 = —^/2v~ 1 log d. 

5. Refined Exponential Concentration for Bounded Random Matrices 

Although Theorem 14,11 is a strong result, the hypothesis Ax =$! cX +ul on the conditional 
variance is too stringent for many situations of interest. Our second major result shows that we 
can use the typical behavior of the conditional variance to obtain tail bounds for the maximum 
eigenvalue of a random Hermitian matrix. 

Theorem 5.1 (Refined Concentration for Random Matrices). Let (X,X') S M. d x M d be a matrix 
Stein pair, and assume that X is almost surely bounded in norm. Define the function 

r(^) := -logEtre^ Ax for each i/> > 0, (5.1) 



where Ax is the conditional variance (12.40 . Then, for all t > and all ip > 0, 



■IWX)^)^.^ ^ ■ (5-2) 



Furthermore, for all ip > 0, 



EA max (X) < V2r(V)logd+^. 



(5.3) 



This theorem is essentially a matrix version of a result from Chatterjee's thesis |Cha081 Thm. 3.13] . 
The proof of Theorem 15. II is similar in spirit to the proof of Theorem 14. 1\ so we postpone the demon- 
stration until Appendix lAl 

Let us offer some remarks to clarify the meaning of this result. Recall that Ax is a stochastic 
approximation for the variance of the random matrix X. We can interpret the function r(ip) as a 
measure of the typical magnitude of the conditional variance. Indeed, the matrix Laplace transform 
result, Proposition 13.31 ensures that 

EA max (Ax)< inf f 



ip>0 



-(V0 + 



The moral of this inequality is that we can often identify a value of ip to make r(^) ~ IE A max (Ax)- 
Ideally, we also want to choose r(ip) S> so that the term r(ip) drives the tail bound (|5.2p 

when the parameter t is small. In the next subsection, we show that these heuristics yield a matrix 
Bernstein inequality. 

5.1. Application: The Matrix Bernstein Inequality. As an illustration of Theorem 15.11 we 
establish a tail bound for a sum of centered, independent random matrices that are subject to a 
uniform norm bound. 

Corollary 5.2 (Matrix Bernstein). Consider an independent sequence (Yk)k>i of random matrices 
in M. d that satisfy 

EYit = and \\Yk\\ < R f or each index k. 

Then, for all t > 0, 

a 



{A m ax(E fc ^)>^}<^exp{^^} ^ere ^ :=\^Y h 
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Furthermore, 

E A max (^2 k r fc ) < aV31ogd + 22 log d. 

Corollary 15.21 is directly comparable with other matrix Bernstein inequalities in the literature. 
The constants here are slightly worse than [Trollbl Thm. 1.4] and slightly better than |Oli09[ 
Thm. 1.2]. The hypotheses in the current result are somewhat stricter than those in the prior 
works. Nevertheless, the proof provides a template for studying more complicated random matrices 
that involve dependent random variables. 



Proof. We consider the matrix Stein pair (X,X') described in Section [2.4[ The calculation (12. 6j) 
shows that the conditional variance of X satisfies 

The function r(tp) measures the typical size of Ax- To control r(ip), we center the conditional 
variance and reduce the expression as follows. 

rNj) := 4logEtre v ' Ax < - logEtr exp \ip(A x - E Ax) + ip ||E Axil • 1} 

ip ip 1 ' 

= - logEtr [e^ 2 -exp{^(A x — E Ax)}] 

= f7 2 + 1 i ogE ire^ Ax ~ EAx \ (5.4) 

V 

The inequality depends on the monotonicity of the trace exponential |Pet94t Sec. 2]. Afterward, we 
have applied the identity ||E Ax|| = ||EX 2 || = a 2 , which follows from (|2.5p and the independence 
of the sequence (Yk)k>i- 

Introduce the centered random matrix 

W := Ax - E Ax = \ ^ (if - E Y k 2 ) . (5.5) 

Observe that W consists of a sum of centered, independent random matrices, so we can study it 
using the matrix Stein pair discussed in Section [2^41 Adapt the conditional variance calculation (12. 6j) 
to obtain 

1 1 f/__o o\2 _ r2\2" 



iE,(^+ E ^ 4 )- 



4 A^k 

To reach the second line, we apply the operator convexity (jl.ip of the matrix square to the first 
parenthesis, and we compute the second expectation explicitly. The third line follows from the 
operator Jensen inequality ()1.2j) . To continue, make the estimate Y^ =^ -R 2 if m both terms. 
Thus, 

±* i t EL W + ■ n 2 ) * f ■ w + ^ • i. 

The trace mgf bound, Lemma 14.31 delivers 

log W (V) = logE W w < f^R^ (5 ' 6) 



To complete the proof, combine the bounds (|5.4j) and (I5.6P to reach 

/ y s. 2 R 2 0~ 2 ljj 
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In particular, it holds that r(R~ 2 ) < 1.5 a 2 . The result now follows from Theorem 15.11 □ 

6. Polynomial Moments and the Spectral Norm of a Random Matrix 

We can also study the spectral norm of a random matrix by bounding its polynomial moments. 
To progress toward these results, let us introduce the family of Schatten norms. 

Definition 6.1 (Schatten Norm). For each p > 1, the Schatten p-norm is defined as 

||B|| p := (tr|.B| p ) 1/p iorB€M d . 

In this setting, \B\ := (B* B) 1 / 2 . Bhatia's book l>li;if)7l Ch. IV] contains a detailed discussion of 
these norms and their properties. 

The following proposition is a matrix analog of the Chebyshev bound from classical probability. 

Proposition 6.2 (Matrix Chebyshev Method). Let X be a random matrix. For all t > 0, 

P{\\X\\ > t] < inf t- p -E||X||P. (6.1) 

Furthermore, 



P >i a p 



EIIXII < inf (E\\X\\ p n ) 1/p . (6.2) 

n v up j 

Proof. To prove (|6.ip . we use Markov's inequality. For p > 1, 

V{\\X\\ >t}< t~ p ■ E ||Xf = t~ p ■ E || \X\ P || < t~ p • Etr \X\ P , 

since the trace of a positive matrix dominates the maximum eigenvalue. To verify (|6.2|) . select 
p > 1. Jensen's inequality implies that 

E ||X|| < (E ||Xf) 1/p = (E || \X\ P \\) 1/p < (Etr |X| p ) 1/p . 

Identify the Schatten p-norm and take infima to complete the bounds. □ 

Remark 6.3 (Chebyshev vs. Laplace). The matrix Chebyshev bound (|6,ip is at least as tight as the 
analogous matrix Laplace transform bound (|3.ip from Proposition 13.31 Indeed, suppose that X is 
a bounded random matrix. For all 9,t > 0, we have 

e -et . EtTe e\x\ =e -et.y°° ^(^-9) Etr i X | 9 

>e -et.y°° [ lAinf ( t -P.Etv\X\ p )] =lAinf (^ • E ||X||fV 

- ^q=0 q\ p>l v 11 >\ p>l v 11 " p/ 

The first identity follows from the Taylor expansion of the matrix exponential. A similar argument 
allows us to convert polynomial moment bounds into bounds on the trace mgf. 

7. Polynomial Moment Inequalities for Random Matrices 

Our last major result demonstrates that the polynomial moments of a random Hermitian matrix 
are controlled by the moments of the conditional variance. By combining this result with the matrix 
Chebyshev method, Proposition 16.21 we can obtain probability inequalities for the spectral norm 
of a random Hermitian matrix. 



Theorem 7.1 (Matrix BDG Inequality). Let p =1 or p > 1.5. Suppose that (X,X') is a matrix 

\2p 



Stein pair where E ||-X"||§! < oo. Then 



E||X|||) 1/2p < v^T-(IE||Ax^) 1/2p . 



The conditional variance Ax is defined in (|2.4|) . 



MATRIX CONCENTRATION VIA EXCHANGEABLE PAIRS 



15 



This theorem extends a result of Chatterjee |Cha071 Thm. 1.5(iii)] to the matrix setting. Chat- 
terjee's bound can be viewed as an exchangeable pairs version of the Burkholder-Davis-Gundy 
(BDG) inequality from classical martingale theory [Bur 73] . Matrix extensions of the BDG inequal- 
ity appear in the work of Pisier-Xu [PX97] and the work of Junge-Xu [JX031 lJX08j . 

The proof of Theorem 17.11 appears below in Section 17.31 Before we present the argument, let us 
offer a few tangential remarks and describe a few striking applications of this inequality. 

Remark 7.2 (Missing Values). The Matrix BDG inequality also holds when 1 < p < 1.5. In this 
range, our best bound for the constant is \/4p — 2. The proof requires a variant of the mean value 
trace inequality for a convex function h. 

Remark 7.3 (Infinite-Dimensional Versions). Theorem 17.11 and its corollaries extend to Schatten- 
class operators. 

7.1. Application: Matrix Khintchine Inequality. First, we demonstrate that the matrix BDG 
inequality contains an improvement of the noncommutative Khintchine inequality |LP86[ ILPP91] 
in the matrix setting. This result has been a dominant tool in several application areas over the 
last few years, largely because of the articles |Rud99l [RV07j . 

Corollary 7.4 (Matrix Khintchine). Suppose that p = 1 or p > 1.5. Consider a finite sequence 
(Yk)k>i °f independent, random, Hermitian matrices and a deterministic sequence {Ak)k>i for 
which 



EY k 



and Y% =<! A 2 , for each index k. 



(7.1) 



Then 



l/2p 



< y/p - 0.5 • 



(£ fc i A l+ EY k 



1/2 



2p 



In particular, when (Ek)k>i is an independent sequence of Rademacher random variables, 

V2P , _w/2 



■ | 2p\ L ' Z P / 



(7.2) 



2p 



Proof. Consider the random matrix X = X^fc^fc- We use the matrix Stein pair constructed in 
Section 12.41 According to (|2.6p , the conditional variance Ax satisfies 



A * = \ E,(n 2 + En 2 ) * \ ^Ut + Er fc 2 ). 



2^fc v K ' " 2 

An application of Theorem 17.11 completes the argument. 

Buchholz [BucOH Thm. 5] has demonstrated that the optimal constant C2 P in (|7.2|) satisfies 

(2p)\ 



□ 



p2p 
2p 



(2p-l)!! 



2Pp\ 



It can be verified using basic analysis that 

(2p 



l)p 



{2p- 1)!! 



< 



for p = 1, 2, 3, ... . 



As a consequence, the constant in (17. 2D lies within a factor i/e of optimal. Previous methods 
for establishing the matrix Khintchine inequality are rather involved, so it is remarkable that the 
simple argument based on exchangeable pairs leads to a result that is so accurate. 
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7.2. Application: Matrix Rosenthal Inequality. As a second example, we can develop a 
more sophisticated set of moment inequalities that are roughly the polynomial equivalent of the 
exponential moment bound underlying the matrix Bernstein inequality. 

Corollary 7.5 (Matrix Rosenthal Inequality). Suppose that p = 1 or p > 1.5. Consider a finite 
sequence (Pk)k>i of independent, random psd matrices that satisfy E||i^||2p < oo. Then 



E 



l/2p 



< 



1/2 



V EI\ + v / 4p-2- 



2p 



(7.3) 



Now, consider a finite sequence {Yk)k>\ of centered, independent, random Hermitian matrices, and 



assume that ¥,\\Yk\\± p < oo. Then 



e||V ■ 



1/Ap 



< ^/4p-l 



1/2 



4p 



l/4p 



(7.4) 



Turn to Appendix [B] for our proof of Corollary 17.51 This result extends a moment inequality 
due to Nagaev and Pinelis [NP77j . which refines the constants in Rosenthal's inequality [Ros70l 
Lem. 1]. See the historical discussion [Pin941 Sec. 5] for details. As we were finishing this paper, 
we learned that Junge and Zheng have recently established a noncommutative moment inequal- 
ity [JZlll Thm. 0.4] that is quite similar to Corollary 17.51 

7.3. Proof of the Matrix BDG Inequality. In many respects, the proof of the matrix BDG 
inequality is similar to the proof of the exponential concentration result, Theorem 14.11 Both are 
based on moment comparison arguments that ultimately depend on the method of exchangeable 
pairs and the mean value trace inequality. 

Suppose that (X,X') is a matrix Stein pair with scale factor a. First, observe that the result 
for p = 1 already follows from (12. 5h . Therefore, we may assume that p > 1.5. Let us introduce 
notation for the quantity of interest: 



E :=Epr||ijJ = Etr \X\ 2p . 
Rewrite the expression for E by peeling off a copy of \X\. This move yields 

E = Etr [|X| • IXI 2 ^ 1 ] = Etr [X • sgn (X) • (X] 2 ^ 1 ] . 
Apply the method of exchangeable pairs, Lemma 12.31 with F(X) = sgn (X) ■ iXj^" 1 to reach 



E = — Etr \(X - X') 
2a 



sgn(X)-\X 



2p-l 



sen (X 



X' 



\2p-l 



)]• 



|2p 



To verify the regularity condition (|2.2p we need for Lemma 12.31 compute that 
E ||(X - X') ■ sgn (X) ■ \X\ 2p ^\\ < E ( {{Xf^ 1 ) + E ( ||X'|| HXp- 1 ) 

< 2(E||X|| 2p ) 1/2p (E||X|| 2p ) (2p - 1)/2p = 2E||X|r < oo. 

We have used the fact that sgn [X) is a unitary matrix, the exchangeability of (X,X'), Holder's 
inequality for expectation, and the fact that the Schatten 2p-norm dominates the spectral norm. 

We intend to apply the mean value trace inequality to obtain an estimate for the quantity E. 
Consider the function h : s i— )• sgn (s) • |s| 2p_1 . Its derivative, h'(s) = (2p — 1 
because p > 1.5. Lemma 13.41 delivers the bound 



,2p-2 

s \ y , is convex 



E < 



2p-l 

4a 
2p-l 

2a 
(2p-l 



Etr [(X - X'f ■ {\X : 
Etr [(X -X'f ■ \X\ 2p ~ 2 
Etr [A x • |X| 2p " 2 ]. 



2p-2 



+ x 



,\2p-2 



)} 
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The second line follows from the exchangeability of X and X' . In the last line, we identify the 
conditional variance Ax, defined in (12, 4D . As before, the moment bound IE 1 1 1 1 ^ < oo is strong 
enough to justify using the pull-through property in this step. 

To continue, we must find a copy of E within the latter expression. We can accomplish this goal 
using one of the basic results from the theory of Schatten norms [Bha97I Cor. IV. 2. 6]. 



Proposition 7.6 (Holder Inequality for Trace). Let p and q be Holder conjugate indices, i.e., 
positive numbers with the relationship q = p/(p — 1). Then 

tr(BC) < ||B|| p ||C|| 9 for all B,C G M d . 

To complete the argument, apply the Holder inequality for the trace followed by the Holder 
inequality for the expectation. Thus, 

• lllXl 2p ~ 2 ' 



IA; 



Ip/(P-1) 



\ A x\\ p • \\X n2p 



2p-2 



i/p 



2p\(p-l)/p 
2p, 



E < (2p — 1) • E 

= (2p - 1) • E 

< (2p- 1) ■ (E| 
= (2p - 1) • (E || A x \\l) 1/p ■ E^P-V/P. 
Solve this algebraic inequality for the positive number E to conclude that 

E<(2p-iy-E\\A x \\>. 
Extract the (2p)th root to establish the matrix BDG inequality. 

8. Extension to General Complex Matrices 

Although it may seem that our theory is limited to random Hermitian matrices, results for 
general random matrices follow as a formal corollary jReclH ITrollbj . The approach is based on a 
device from operator theory |Pau02j . 

Definition 8.1 (Hermitian Dilation). Let B be a matrix in C dlXd2 , and set d = d\ + di- The 

Hermitian dilation of B is the matrix 

" B] 

B* 



9{B) 



G 



The dilation has two valuable properties. First, it preserves spectral information: 

A max (^(£0) = \\®{B)\\ = \\B\\. 
Second, the square of the dilation satisfies 



1.2) 



9{B) 



BB* 






B*B 



(8.3) 



We can study a random matrix — not necessarily Hermitian — by applying our matrix concentra- 
tion inequalities to the Hermitian dilation of the random matrix. As an illustration, let us prove a 
Bernstein inequality for general random matrices. 

Corollary 8.2 (Bernstein Inequality for General Matrices). Consider a finite sequence (^)/ >i of 
independent random matrices in <C dlXd2 that satisfy 

E Zk = and \\Zf-\\ < R for each index k. 

Define d := d% + cfej an d introduce the variance measure 

a 2 :=max{||^ fc E(Z fc ^)||, E(Z* k Z k )\\ } . 
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Then, for all t > 0, 



Furthermore, 



\Y, h Z^\>t}<d.ev{^^} 



e||V Z k < a^31ogd + Rlogd. 

Proof. Consider the random series 3>{Zk). The summands are independent, random Hermitian 
matrices that satisfy 

E^(Z fc ) = and \\®(Z k )\\<R. 



The second identity depends on the spectral property (|8.2p . Therefore, the matrix Bernstein 
inequality, Corollary 15.21 applies. To state the outcome, we first note that A max (^ fc @(Zk)) = 
\\J2kZk\\, again because of the spectral property (18. 2p . Next, use the formula (|8.3|) to compute 
that 



|V E[9(Z k ) 



E k nz k z* k ) 









E fc E(^Z fc ) 



a 2 . 



This observation completes the proof. 



□ 



9. The Sum of Conditionally Zero-Mean Matrices 

A chief advantage of the method of exchangeable pairs is its ability to handle random matrices 
constructed from dependent random variables. In this section, we briefly describe a way to relax 
the independence requirement when studying a sum of random matrices. In Sections [10] and [HJ 
we develop more elaborate examples. 

9.1. Formulation. Consider a finite sequence (Y{, . . . , Y n ) of random Hermitian matrices. We say 
that the sequence has the conditional zero-mean property when 

E[Yfe | (Yj)j^j-] = almost surely for each index k. (9-1) 

This definition is related to the conditional expectation property of a martingale difference sequence, 
although it is more restrictive. Suppose that we are interested in a sum of conditionally zero-mean 
random matrices: 

X:=Y 1 + ---+Y n . (9.2) 

This type of series is quite common because it includes the case of a Rademacher series with random 
matrix coefficients. 

Example 9.1 (Rademacher Series with Random Matrix Coefficients). Consider a finite sequence 
(Wk)k>i of random Hermitian matrices. Suppose that the sequence (£&)&> i consists of independent 
Rademacher random variables that are independent from the random matrices. Consider the 
random series 

The summands may be strongly dependent on each other, but the independence of the Rademacher 
variables ensures that the summands satisfy the conditional zero- mean property (|9.ip . 
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9.2. A Matrix Stein Pair. Let us describe how to build a matrix Stein pair (X,X') for the 
sum (|9.2p of conditionally zero- mean random matrices. The approach is similar to the case of 
an independent sum, which appears in Section 12.41 For each k, we draw a random matrix Yl so 
that Y£ and 1^ are conditionally i.i.d. given (Yj)j-tk- Then, independently, we draw an index K 
uniformly at random from {1, . . . , n}. As in Section [2.41 the random matrix 

X' := Yi + • • • + Y K - X + Y^ + Y K+1 + --- + Y n 

is an exchangeable counterpart to X. The conditional zero-mean property (|9.ip ensures that 

E[X - X' | (Vj)j>i] = E[Y K - Y' K | (Y,) el ] 

= \ EL K - E K I «W) = ; EL, n = ix. 

Therefore, (X, Jt') is a matrix Stein pair with scale factor a = 

We can determine the conditional variance after a short argument that parallels the computation 
(|2.6h in the independent setting: 



\ ■ k [<y K - Yk? i (is-)i>i] = \ YZ =1 ( Y k + m k 2 1 (iso^d) • (9.3) 



The expression (|9.3p shows that, even in the presence of some dependence, we can control the size 
of the conditional expectation uniformly if we control the size of the individual summands. 

Using the Stein pair (X , X') and the expression (|9.3p . we may develop a variety of concentration 
inequalities for conditionally zero-mean sums that are analogous to our results for independent 
sums. We omit detailed examples. 



10. Combinatorial Sums of Matrices 

The method of exchangeable pairs can also be applied to many types of highly symmetric dis- 
tributions. In this section, we study a class of combinatorial matrix statistics, which generalize the 
scalar statistics studied by Hoeffding |Hoe51| . 

10.1. Formulation. Consider a deterministic array (Ajk)™ k=1 of Hermitian matrices, and let ir be 
a uniformly random permutation on {1, . . . , n}. Define the random matrix 

En 1 v — \n 

A j7r(j ) whose mean EY = — > A,- fc . (10.1) 

j=i n ±- — *j,k=i 

The combinatorial sum Y is a natural candidate for an exchangeable pair analysis because the 
random permutation is highly symmetric. Before we describe how to construct a matrix Stein pair, 
let us mention a few problems that lead to a random matrix of the form Y. 

Example 10.1 (Sampling without Replacement). Consider a finite collection SB := {Si, . . . , B n } 
of deterministic Hermitian matrices. Suppose that we want to study a sum of s matrices sampled 
randomly from 38 without replacement. We can express this type of series in the form 

where ir is a random permutation on {1,... ,n}. The matrix W is therefore an example of a 
combinatorial sum. 

Example 10.2 (A Randomized "Inner Product"). Consider two fixed sequences of complex ma- 
trices 

B 1 ,...,B n eC dlXS and C u . . . , C n G C sxda . 
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We may form a permuted matrix "inner product" by arranging one sequence in random order, 
multiplying the elements of the two sequences together, and summing the terms. In other words, 
we are interested in the random matrix 



Z:=V r 



Then the random matrix 3)(Z) is a combinatorial sum of Hermitian matrices. 

10.2. A Matrix Stein Pair. To study the combinatorial sum ([10. ip of matrices using the method 
of exchangeable pairs, we first introduce the zero-mean random matrix 

X :=Y-EY. 

To construct a matrix Stein pair (X , X'), we draw a pair (J, K) of indices independently of tt and 
uniformly at random from {1, . . . ,n} 2 . Define a second random permutation tt' := n o (J,K) by 
composing tt with the transposition of the random indices J and K . The pair (tt, tt') is exchangeable, 
so the matrix 

is an exchangeable counterpart to X. 

To verify that (X, X') is a matrix Stein pair, we calculate that 

E[X - X' | tt] = E [Aj AJ) + A Kn{K) - Aj^k) ~ A Kn(J) \ tt] 

1 ^— \H r i 2 2 

= ^2 Z^- fc= i l A Mi) + A Mfc) ~ A Mk) ~ A Mi)J = ~( Y ~ EY ) = ~ x - 

The first identity holds because the random sums X and X' differ for only four choices of indices. 
We see that (X, X') is a Stein pair with scale factor a = 2/n. 
Turning to the conditional variance, we find that 

A X (tt) = \ E [(X - X'f | tt] = -L Y,l k=1 l A Mi) + A ^k) ~ A Mk) - A kn{j) ] \ (10.2) 

The structure of the conditional variance is somewhat different from previous examples, but we 
recognize that Ax is controlled when the matrices Aj k are bounded. 

10.3. Exponential Concentration for a Combinatorial Sum. We can apply our matrix con- 
centration results to study the behavior of a combinatorial sum of matrices. As an example, let us 
present a Bernstein-type inequality. The argument is similar to the proof of Corollary 15. 2\ so we 
leave the details to Appendix O 

Corollary 10.3 (Bernstein Inequality for a Combinatorial Sum of Matrices). Consider an array 
(Ajk)™ k= i of deterministic matrices in M d that satisfy 



En 
j,k=l 



ijk = and \\AjkW < R for each pair (j,k) of indices. 



Define the random matrix X := Y^j=i A jTr(j)> w here tt is a uniformly random permutation on 
{1, . . . , n}. Then, for all t > 0, 



( -r 1 2 1 llv^ n 

' i^ma,x(X) > t\ < d ■ exp < -= — > where a := — > A 



Furthermore, 



EA max (X) < ay/ 12 log d + 2^2 R log d. 
11. Self-Reproducing Matrix Functions 



The method of exchangeable pairs can also be used to analyze nonlinear matrix-valued functions 
of random variables. In this section, we explain how to analyze matrix functions that satisfy a 
self-reproducing property. 
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11.1. Example: Matrix Second-Order Rademacher Chaos. We begin with an example that 
shows how the self-reproducing property might arise. Consider a quadratic form that takes on 
random matrix values: 

£ fc E i<Jfc e i e * A i*- (11-1) 

In this expression, e is a finite vector of independent Rademacher random variables. The array 
(Ajk)j t k>i consists of deterministic Hermitian matrices, and we assume that Aj k = A k j. 

Observe that the summands in H(e) are dependent and that they need not satisfy the conditional 
zero-mean property (|9.ip . Nevertheless, H(e) does satisfy a fruitful self-reproducing property: 

J2 k W £ ) - E ^( £ ) i foM) = E fe E# fc £ ^ £k - E i £ k])Ajk 

We have applied the pull-through property of conditional expectation, the assumption that the 
Rademacher variables are independent, and the fact that Aj k = A k j. As we will see, this type of 
self-reproducing property can be used to construct a matrix Stein pair. 

A random matrix of the form ([11. ip is called a second-order Rademacher chaos. This class of 
random matrices arises in a variety of situations, including randomized linear algebra [CDllaj . 
compressed sensing |RaulO[ Sec. 9], and chance-constrained optimization [CSW11| . Indeed, con- 
centration inequalities for the matrix-valued Rademacher chaos have a wide range of potential 
applications. 

11.2. Formulation and Matrix Stein Pair. In this section, we describe a more general version 
of the self-reproducing property. Suppose that z := {Z\, . . . , Z n ) is a random vector taking values 
in a Polish space Z. First, we construct an exchangeable counterpart 

z' := (Zi, . . . , Zk-i, Z' k ,Zk+i, ■ ■ ■ , Z n ) (11-2) 

where Z k and Zi are conditionally i.i.d. given (Zj)j^ k and K is an independent coordinate drawn 
uniformly at random from {1, . . . , n}. 

Next, let H : Z n — > M d be a bounded measurable function. Assume that H(z) satisfies an 
abstract self-reproducing property: for a parameter s > 0, 

V" (H(z)-MH(Z 1 ,...,ZL...,Z n )\z]) =s-(H(z)-MH(z)) almost surely. (11.3) 
^— 'fc=i 

Under this assumption, we can easily check that the random matrices 

X := H(z) - E H(z) and X' := H(z') - E H(z) 
form a matrix Stein pair. Indeed, 

E[X -X'\z}= E[H(z) - H(z') I z\ = -(H(z) -EH{z)) = -X. 

n n 

We determine that (X, X') is a matrix Stein pair with scaling factor a = s/n. 
Finally, we compute the conditional variance: 

A x (z) = |- E [(H(z) - H(z')f | z] = 1 Y!U E { {H{Z) " H{Zu ■ ■ ■ ' Z ' k > ■ ■ ■ ' Zn))2 1 • 

We discover that the conditional variance is small when each coordinate of H has controlled vari- 
ance. In this case, the method of exchangeable pairs provides good concentration inequalities for 
the random matrix X. 
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11.3. Matrix Bounded Differences Inequality. As an example of this framework, we can 
develop a bounded differences inequality for random matrices by appealing to Theorem 14.11 

Corollary 11.1 (Matrix Bounded Differences). Let z := (Z\, . . . , Z n ) be a random vector taking 
values in a Polish space Z, and let z' be an exchangeable counterpart, constructed as in (|11.2p . 
Suppose that H : Z n — > M d is a bounded function that satisfies the self-reproducing property 

V n (H(z)-E[H(Z 1 ,...,ZL...,Z n )\z]) =s-(H(z)-EH(z)) almost surely 
for a parameter s > as well as the bounded differences condition 

E \^H(z) — H(Z±, . . . , Z' k , . . . , Z n )) 2 | z] =<: A\ almost surely for each index k, (H-4) 
where A k is a deterministic matrix in M d . Then, for all t > 0, 

F{X max (H(z)-EH(z)) >t} <d-e- st2 ' L where L := A\ . 

IK 'K = l 

Furthermore, 

EX max (H(z)-EH(z)) < 

In the scalar setting, Corollary 111.11 reduces to a version of McDiarmid's bounded difference in- 
equality [McD89 . The result also complements the matrix bounded difference inequality of |Trollb[ 
Cor. 7.5], which requires independent input variables but makes no self-reproducing assumption. 

Proof. Since H{z) is self-reproducing, we may construct a matrix Stein pair {X,X') with scale 
factor a = s/n as in Section [TTJ According to (jll.2j) , the conditional variance of the pair satisfies 

Ax = Y s ELi E (z) " H ■■■^' k ,...,z n )) 2 \z] 

^ 2s ^k=i k ^ 2s 

We have used the bounded differences condition (jll.4p and the definition of the bound L. To 
complete the proof, we apply the concentration result, Theorem 14. 1\ with the parameters c = 
and v = L/2s. □ 

12. Extensions and Future Work 

Although the examples we treat in this work all involve matrix Stein pairs, we are aware of 
problems that require a larger class of exchangeable pairs. The following definition represents a 
significant extension of the matrix Stein pair formalism. 

Definition 12.1 (Generalized Matrix Stein Pair). Let g : M — > M be a weakly increasing function. 
Let (Z, Z') be an exchangeable pair of random variables taking values in a Polish space 2, and let 
\l/ : Z — > M d be a measurable function. Define the random Hermitian matrices 

X := *(Z) and X' := *(Z'). 

We say that (X, X') is a generalized matrix Stein pair if there is a constant a € (0, 1] for which 

E[g(X ) — g(X') | Z] = aX almost surely. 

When the function g is linear, a generalized matrix Stein pair reduces to a matrix Stein pair. The 
conditional variance of the generalized pair is defined as 

A x := A X (Z) := -L • ReE [(g(X) - g{X')) ■ (X - X') \ Z] , 

where Re(A) := \{A + A*) refers to the Hermitian part of a square matrix. 



iLlogd 



MATRIX CONCENTRATION VIA EXCHANGEABLE PAIRS 



23 



We have chosen to focus on the simpler definition of a Stein pair because the presentation is 
more transparent. Nevertheless, the results in this paper all extend readily to generalized matrix 
Stein pairs with little extra work. We leave applications of generalized matrix Stein pairs for future 
research. 
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Appendix A. Proof of Theorem I5.lt Refined Exponential Concentration 

The proof of Theorem 15.11 parallels the argument in Theorem 14.11 but it differs at an important 
point. In the earlier result, we used an almost-sure bound on the conditional variance to control 
the derivative of the trace mgf. This time, we use entropy inequalities to introduce finer informa- 
tion about the behavior of the conditional variance. The proof is essentially a matrix version of 
Chatterjee's argument |Cha08|, Thm. 3.13]. 

Let (X, X') be a matrix Stein pair. Consider the normalized trace mgf 

m(0) := Etre ex for 6 > 0. (A.l) 

Our main object is to establish the following inequality for the trace mgf. 

r(ib)0 2 r- 
log m(0) < , yv ' - - for < 6 < ^N) and ib > 0. (A.2) 
2(1 — 6 z /w) 
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The function r(i/j) is denned in (|5.ip . We establish ()A.2p in Section lA.ll et seq. Afterward, in 
Section IA.5| we invoke the matrix Laplace transform bound to complete the proof of Theorem 15.11 

A.l. The Derivative of the Trace Mgf. The first steps of the argument are the same as in 
the proof of Theorem 14.11 Since X is almost surely bounded, we need not worry about regularity 
conditions. The derivative of the trace mgf satisfies 

m'(9) = Etr [Xe 9X ] for 9 (A.3) 

Lemma 13.71 provides a bound for the derivative in terms of the conditional variance: 

m'{9) < 9 ■ Etr [A x e 0X ] for 9 > 0. (A.4) 

In the proof of Lemma [4.3| we applied an almost-sure bound for the conditional variance to control 
the derivative of the mgf. This time, we incorporate information about the typical size of Ax by 
developing a bound in terms of the function r(ip). 

A. 2. Entropy for Random Matrices and Duality. Let us introduce an entropy function for 
random matrices. 

Definition A.l (Entropy for Random Matrices). Let W be a random matrix in subject to 
the normalization Etr W = 1. The (negative) matrix entropy is defined as 

ent(W) :=Etr(WlogW). (A.5) 
We enforce the convention that OlogO = 0. 

The matrix entropy is relevant to our discussion because its Fenchel-Legendre conjugate is the 
cumulant generating function. The Young inequality for matrix entropy provides one way to for- 
mulate this duality relationship. 

Proposition A. 2 (Young Inequality for Matrix Entropy). Suppose that V is a random matrix in 
M d that is almost surely bounded in norm, and suppose that W is a random matrix in subject 
to the normalization E tr W = 1 . Then 

Etr(VW) < logEtre v + ent(W). 

Proposition IA.2I follows from an easy variant of the argument in [CarlO} Thm. 2.13]. 

A.3. A Refined Differential Inequality for the Trace Mgf. We intend to apply the Young 
inequality for matrix entropy to decouple the product of random matrices in (1A.4|) . First, we must 
rescale the exponential in (|A.4j) so its expected trace equals one: 

Etre ox m(9) 

For each ip > 0, we can rewrite (IA.4j) as 

m'(9) < • Etr \^A X ■ W(0)] . 

ip 1 J 

The Young inequality for matrix entropy, Proposition IA.21 implies that 

m '(0) < [logEtre^* + ent(W(0))l . (A.7) 

W L J 

The first term in the bracket is precisely ipr(tp). Let us examine the second term more closely. 

To control the matrix entropy of W(9), we need to bound its logarithm. Referring back to the 
definition (|A.6|) . we see that 

logW(9) = 9X - (logEtre ex ) • I =<; 9X — (logt"re eEX ) • I = 9X. (A.8) 
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The second relation depends on Jensen's inequality and the fact that the trace exponential is 
convex |Pet94l Sec. 2]. The third relation relies on the property that E X = 0. Since the matrix 
W(9) is positive, we can substitute the semidefinite bound (|A.8|) into the definition (|A.5|) of the 
matrix entropy: 

ent(W(0)) = Etr [W{6) ■ log W(6)) < 9 ■ Etr [W(0) ■ X] = • Etr [Xe ex ]. 

We have reintroduced the definition (IA.6P of W(9) in the last relation. Identify the derivative (IA.3|) 
of the trace mgf to reach 

ent(W(9)) < (A.9) 

To establish a differential inequality, substitute the definition (|5.ip of r(ijj) and the bound (]A.9P 
into the estimate ()A.7p to discover that 



m'{6) < 



'm(fl) 



9 m' (9) 
m(9) 
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r(ip)6-m(6) + — -m'(6). 



Rearrange this formula to isolate the log-derivative m!{9)/m(9) of the trace mgf. We conclude that 

d r(ib)9 /— 

— logm(0)< forO<0<^. (A.10) 

A.4. Solving the Differential Inequality. To integrate (lA.lOp . recall that logm(0) = 0, and 
invoke the fundamental theorem of calculus to see that 

f d r(ib)s , f e r(ib)s , r(ip)9 2 
log m(9) < / — ^-fy-ds^ / — ^yy-ds- 



/ l-s 2 /V " 7o l-0 2 /i> 2(1 -0 2 /^)' 
This calculation is valid whenever < (9 < y^. This is the claim (IA.2|) . 

A. 5. The Matrix Laplace Transform Argument. With the trace mgf bound (|A.2|) at hand, 
we can complete the proof of Theorem 15. 11 Proposition 13.31 the matrix Laplace transform method, 
yields the estimate 

P{A max (X)>t}<d- inf e W {-9t+ e -- r{i, J 2 9 \<d-e- te ^ 2 , (A.ll) 

o<e<^p [ 2 1 - 9 z /tp J 

where we define implicitly as the positive root of the quadratic equation 

r(ip) 9 
1 - 9 2 /^ ~ *' 

Solve the quadratic equation to obtain the explicit formula 



The numerical fact \/i + a < 1 + yfa, valid for a > 0, allows us to verify that < \f^>. We can 
obtain a lower bound 

from the inequality \/l + a — 1 > a/(2 + which holds for a > 0. Indeed, 
a = (vTTa- 1)(\/Tn + 1) < (y/T+a- 1)(2 + y/E). 

Introduce the estimate (1A.13|) into the probability inequality (lA.llh to complete the proof of the 
tail bound ([5T2j) . 
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To establish the inequality (|5.3j) for the expectation of the maximum eigenvalue, we can apply 
Proposition 13.31 and the trace mgf bound (|A.2|) a second time. Indeed, 



EX max {X)< inf 

0<9<^p 



log d 1 
— - — I- - 
6 2 



< inf 

t>o 



logo! t 
d^t) + 2 



where is the function defined in (|A.12p . Incorporate the lower bound ()A.13p to reach 



EA r 



AX) < inf 



r(ip) log d t_ 



t 2 

This observation completes the proof of Theorem 15.1 
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Appendix B. Proof of Theorem I7.5t Matrix Rosenthal Inequality 

The proof of the matrix Rosenthal inequality takes place in two steps. First, we verify that the 
bound (|7.3p holds for psd random matrices. Then, we use this result to provide a short proof of 
the bound (|7.4p for Hermitian random matrices. Before we start, let us remind the reader that the 
L p norm of a scalar random variable Z is given by the expression (E|Z| p ) 1 / p for each p > 1. 



B.l. A Sum of Random Psd Matrices. We begin with the moment bound 
pendent sum of random psd matrices. Introduce the quantity of interest 

2p\ V2P 



for an inde- 



E z 



II 2 p\ 



We may invoke the triangle inequality for the Lip norm to obtain 

l/2p 



/ I, 2p\ L ' z P || 

E 2 < iK^^Pk-^Pk) ) +\\/Z k EP k 



2p 



E||X||| 



l/2p 



+ 11. 



We can apply the matrix BDG inequality to control this expectation, which yields an algebraic 
inequality between E 2 and E. We solve this inequality to bound E 2 . 

The series X consists of centered, independent random matrices, so we can use the Stein pair 
described in Section [2.41 According to (|2.6|) . the conditional variance Ax takes the form 



2 
1 

2 



EP k ) 2 +E(P k 



EP h ) 2 ] 



4 - V \2Pl + 2(E P k f + E Pi 



(EP fc ) 2 



E, 



EP£). 



The first inequality follows from the operator convexity (jl.ip of the square function; we have 
computed the second expectation exactly. The last bound follows from the operator Jensen in- 
equality (jl.2p . Now, the matrix BDG inequality yields 

l/2p 



E < yj2p - 1 • E||A X 



< ^2p~^l ■ (e||^ {P 2 + EP 2 )\Q 



l/2p 



E 



l/2p 



+ 11. 



The third line follows from the triangle inequality for the L p norm and Jensen's inequality. 

Next, we search for a copy of E 2 inside this expectation. To accomplish this goal, we want to 
draw a factor P k off of each term in the sum. The following result of Pisier and Xu [PX971 Lem. 2.6] 
has the form we desire. 
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Proposition B.l (A Matrix Schwarz-Type Inequality). Consider a finite sequence (Ak)k>\ of 
deterministic psd matrices. For each p > 1, 



lE^i 



Apply the matrix Schwarz-type inequality, Proposition lB.il to reach 



^<y4^2.|E(^||P fc |||) 



1/2 



i v 
hp 



l/2p 



l/4p 



l/4p 



II 2 p\ /P 

E \\^Mlp) + 



/'■ 



The second bound is the Cauchy-Schwarz inequality for expectation. The resulting estimate takes 
the form E 2 < cE + fj,. Solutions of this quadratic inequality must satisfy E < c + ^/Ji- We reach 



Square this expression to complete the proof of ([7. 3D 



N l/4p 



l£* EJ * 



1/2 
2p 



B.2. A Sum of Centered, Random Hermitian Matrices. We are now prepared to establish 
the bound (|7.4p for a sum of centered, independent, random Hermitian matrices. Define the random 
matrix X := Y^ fc Yju. We may use the matrix Stein pair described in Section[23J According to (|2.6p . 
the conditional variance Ax takes the form 



2 

The matrix BDG inequality, Theorem I7.lt yields 



A* = iX\(n a +Ei?). 



(E||X||*) 1/4 '<V4^T.(E||A X |||) 1/4 ' 



V> - 1 ■ E 



9 



i2p\ V4p 
l2p. 



l/4p 



The third line follows from the triangle inequality for the Li v norm and Jensen's inequality. To 
bound the remaining expectation, we simply note that the sum consists of independent, random psd 
matrices. We complete the proof by invoking the matrix Rosenthal inequality (17. 3D and simplifying. 



Appendix C. Proof of Theorem 110.31 : A Combinatorial Sum of Matrices 

Consider the matrix Stein pair (X,X') constructed in Section [10.21 The expression f 1 1 . 2 j) and 
the operator convexity (jl.ip of the matrix square allow us to bound the conditional variance as 
follows. 



j'Tr(fc) 



1 ^— \n r 



kir(j)\ 



E n .o 2 v— 
42 j \ 
j=1 Mj) -r n Z^j- fc= i 



l jk 



W + 4E 
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where 

^ : = 2 (E; =1 ^0))- 2S «»d E:=l^ fcBsi Aj fc . 
Substitute the bound for Ax(tt) into the definition (|5.ip of r(?/>) to see that 

r (W,) : = IlogEtre^ Ax M < - logEtr e ^ w+4E > < 4cr 2 + — logEt~re^ W . (C.l) 

The inequalities follow from the monotonicity of the trace exponential |Pet94l Sec. 2] and the fact 
that a 2 = ||S||. Therefore, it suffices to bound the trace mgf of W. 

Our approach is to construct a matrix Stein pair for W and to argue that the associated condi- 
tional variance A.w(ir) satisfies a semidefinite bound. We may then exploit the trace mgf bounds 
from Lemma 14.31 Observe that W and X take the same form: both have mean zero and share the 
structure of a combinatorial sum. Therefore, we can study the behavior of W using the matrix 
Stein pair from Section [10.21 Adapting (|10.2p . we see that the conditional variance of W satisfies 



= I E ■ k=l l A U) + A Uk) - A lw - A 



2 1 
kn(j)i 



4 Z E, fe=i i A %(j) + A kAk) + A %(k) + A k 



n 1 — '3 

4R 2 



t(j)J 



^ ~ E it k =i [ A Mi) + A M fc ) + A Mk) + A Mj)] • 

In the first line, the centering terms in W cancel each other out. Then we apply the operator 
convexity (jl.ip of the matrix square and the bound A^ k =$! R 2 A 2 k . Finally, identify W and S to 
reach 

A w (ir) 4 AR 2 {W + 4S) ^ AR 2 ■ W + l6R 2 a 2 ■ I. (C.2) 
The matrix inequality (|C.2p gives us access to established trace mgf bounds. Indeed, 

l0gEtl ' e ^1^4^ 

as a consequence of Lemma 14.31 with parameters c = AR 2 and u = 16i? 2 cr 2 . 
At last, we substitute the latter bound into (jC.ip to discover that 

In particular, setting ip = (Si? 2 )" 1 , we find that r(tp) < 6a 2 . Apply Theorem 15.11 to wrap up. 



